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Abstract 

Criteria for prediction of multinomial responses are examined in terms of estimation bias. 
Logarithmic penalty and least squares are quite similar in behavior but quite different from 
maximum probability. The differences ultimately reflect deficiencies in the behavior of the 
criterion of maximum probability. 
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Introduction 

In statistical applications, it is common to predict a polytomous variable Y by use 
of one or more continuous or discrete variables. Such applications are encountered in the 
study of educational testing. A response to a multiple-choice item is a polytomous variable; 
commonly used holistic scores in grading of essays are polytomous, as are many validity 
criteria such as whether graduation. Polytomous responses may or may not have obvious 
associated numerical values. A simple numerical description is not appropriate for the 
response to a multiple choice question, for there may be a correct response, there may be 
three different incorrect responses, no response at all may exist, or there may be an invalid 
response in which more than one choice is marked. On the other hand, a holistic essay 
score may be a number from 1 to 6, with a higher number indicating a better response. 
Especially in cases in which no appropriate numerical values correspond to the values of Y, 
there are basic questions concerning the meaning of a prediction of Y. Given a definition 
of a prediction of Y, there is then the problem of criteria for evaluation of the prediction. 
These problems have been treated extensively in the statistical literature (Savage, 1971; 
Goodman & Kruskal, 1954; Haberman, 1982a; Haberman, 1982b; Gilula & Haberman, 
1995b); nonetheless, it is not often appreciated that the appropriate strategy for prediction 
depends quite strongly on the criterion used to assess the quality of the prediction. In 
Section 1, prediction criteria based on penalty functions are developed (Haberman, 1982a; 
Haberman, 1982b), and the criteria based on squared error penalty, logarithmic probability 
penalty, and misclassification penalty are introduced. In Section 2, samples are used to 
develop probability predictions (Haberman, 1982a; Haberman, 1982b; Gilula & Haberman, 
1995b). In Section 3, some large-sample results are used in a simple case to show that 
criteria based on misclassification penalty are much different in nature than criteria based 
on squared error penalty or on logarithmic penalty. This comparison appears to be new. 
Section 4 examines consequences of the results derived. The most important conclusion is 
that use of classification error rates is a questionable approach despite its intuitive appeal. 
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Table 1. 


Joint Probabilities of Human and Machine Scores 


Machine 

score 

1 

2 

Human 

3 

score 

4 

5 

6 

Total 

1 

0.0080 

0.0120 

0.0010 

0.0010 

0.0005 

0.0005 

0.0230 

2 

0.0240 

0.0440 

0.0760 

0.0160 

0.0010 

0.0010 

0.1620 

3 

0.0040 

0.0440 

0.1000 

0.0760 

0.0040 

0.0010 

0.2290 

4 

0.0010 

0.0080 

0.0960 

0.1760 

0.0680 

0.0010 

0.3500 

5 

0.0010 

0.0040 

0.0040 

0.0400 

0.0800 

0.0480 

0.1770 

6 

0.0005 

0.0005 

0.0010 

0.0010 

0.0240 

0.0320 

0.0590 

Total 

0.0385 

0.1125 

0.2780 

0.3100 

0.1775 

0.0835 

1.0000 


1 Penalty Functions 

To examine the problem of prediction of polytomous variables, consider a polytomous 
response random variable Y with values in a finite set S = {yi : 1 < i < s} with s elements 
and an explanatory random variable X with values in some space T. For instance, in the 
case of a holistic essay score from 1 to 6, S' is just the set of integers from 1 to 6 and each y t 
is the integer i. The variable Y might be the final holistic score obtained from human raters 
for a randomly selected essay for a specific prompt, and X might be the machine-derived 
holistic score for the same essay. For illustrative purposes, a joint distribution of X and 
Y is provided in Table 1. These probabilities are comparable with reported sample data 
(Feng et ah, 2003). In this case, S and T are the same. Let Py-x(v\x) be the conditional 
probability that Y = y given that X = x for x in T and y in S, and let Py(u), be the 
marginal probability that Y = y. In the essay example, Pyx( 3|3) = 0.1000/0.2290 = 0.4367 
is the conditional probability of a human score of 3 given a machine score of 3, and 
Py( 3) = 0.2780 is the marginal probability of a human score of 3. The conditional 
distribution of Y given X is fully described by the conditional probability vectors pr-x(^) 
with coordinates Pyx{Ui |x) for integers i from 1 to s. Similarly, the marginal distribution 
of Y is completely described by the s-dimensional vector py with coordinates PriVi) for 
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integers i from 1 to s. In this paper, basic operations on s-dimensional vectors will be 
common. If c is an s-dimensional vector with coordinates q for 1 < i < s, then the squared 
Euclidean norm of c is defined by 



2=1 


and the maximum norm of c is defined by 

|c|oo = max led. 

l<i<s 

The tie function 2(c) is the number of members of the set A(c) of integers i from 1 to s 
such that | Ci\ = |c|oo. At times, inhnite quantities must be considered. The convention is 
adopted that logO = — oo and Ooo = 0. The variance of a random variable V is denoted by 

o*(V). 

To examine sampling, let X h and Y hj 1 < h < n, be sampled random variables such 
that each pair (. Xh,Yh ), 1 < i < n, is mutually independent, independent of (X,Y), and 
distributed as (X, Y). Thus in the essay example, there would be n observed essays. For 
essay h, Xh would be the machine-derived holistic score, and Yj, would be the human essay 
score. 

The basic problem under study is prediction of the response variable Y by the 
explanatory variable X. As previously noted, because Y is polytomous and the set of 
possible values S may not have a useful numerical representation, it is not necessarily 
appropriate to approximate Y by a single numerical value. On the other hand, Y always 
has an s-dimensional vector representation Z with coordinates Z* for 1 < i < s. Let d, be 
the s-dimensional vector with coordinates 8 VJ , 1 < j < q, such that S t] = 1 if i = j, and 
Sij = 0 if i j. If Y = i/i, then Z = d,. Obviously Y determines Z. Given Z, Y is identified 
as the value y* such that Z % = 1. For each integer i, the coordinate Z t is always nonnegative, 
and the sum )= =1 Z* = 1. Let a superscript T denote a transpose. Then in the case of 
holistic scoring, Z is the six-dimensional vector (0,0,1,0,0,0) T if the holistic score Y is 3. 
The unconditional expected value of Z is py, and p y-x(x) is the conditional expected value 
of Z given X = x. Naturally, py(y) and Py-x(v\x) are both nonnegative, and 

S S 

^PriVi) = PYx(Vi\x) = 1 . 

2=1 2=1 
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Thus the observation Z and the probability vectors py and p y-x(%) are all members of 
the unit simplex Q of s-dimensional vectors a with nonnegative coordinates a* and sum 
Si=i a i — 1) and Z is always a vertex of Q. In this paper, probability predictors are studied. 
Here a probability predictor q(X) is a random vector such that q is a function from the 
space T of values of X to the unit simplex Q (Savage, 1971; Haberman, 1982a; Haberman, 
1982b). In this fashion, p y-x(X) is a probability predictor, as is the constant predictor 
p yc equal to py for any possible value of X. One may regard a probability predictor as 
an approximation of Z. Because the original observation Y is a one-to-one function of the 
vector Z, a probability predictor also provides a type of prediction of Y. 

To study probability prediction, accuracy of prediction must be considered. Several 
common approaches exist that can be described within a common framework (Savage, 1971; 
Haberman, 1982a; Haberman, 1982b) based on a nonnegative and possibly infinite penalty 
function L designed to measure the discrepancy between the observed vector Z and the 
probability predictor q(X). For any value y of Y and any member a of the unit simplex, L 
assumes a value L(y, a). The observed penalty is L(y,q(x)) if Y = y and A" = x. In this 
report, the three penalty functions considered are the squared error penalty function Lc 
defined by 

-^cil/ii a) a| 2 , 

with the logarithmic penalty function Lh defined by 

L H (yi, a) = log Uj, 


and the misclassification penalty function Lm defined by 


a) 


1, i&A( a), 

1 — l/i(a), i e H(a). 


Thus the squared error function Lc{Y, a) is the squared Euclidean distance between the 
observed vector Z and the probability prediction a. If Y = y l} then the logarithmic 
probability penalty Lh(Y, a) = — log a.; is minus the logarithm of the probability a j 
predicted for the observed value y.i of Y. The misclassification rate penalty L M (Y, a) is 
based on the idea of classification of Y as the value yt with the highest probability Gp. In 
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this fashion, Lm(Y, a) is 1 if the observed value y l of Y has assigned probability a* less than 
the maximum probability aj^, assigned to a value of Y, so that Y is not classified correctly 
by the classification rule. If the assigned probability a* for Y = y t is larger than any other 
probability % assigned to Y = yj ^ y tl then Lm(Y, a) is 0, for Y is classified correctly by 
the classification rule. The case of t(a) > 1 and i in A (a) is a bit more complicated. In 
this instance, classification is ambiguous, so that Y is randomly classified with probability 
l/t(a) as y.j if j is in A (a). For Y — y l} the probability of incorrect classification is then 
1 — l/t(a). 

As an example, consider the case of human essay scoring. Let the actual score be 3, 
and let a probability predictor assign probability 0.1 to scores 1 and 6, probability 0.15 
to scores 2 and 5, and probability 0.25 to scores 3 and 4. Here the maximum predicted 
probability 0.25 is assigned to both scores 3 and 4. The squared error penalty is 

(0 - 0.1) 2 + (0 - 0.15) 2 + (1 - 0.25) 2 + (0 - 0.25) 2 + (0 - 0.15) 2 + (0 - 0.1) 2 = 0.69, 
the logarithmic penalty is 

— log(0.25) = 1.386, 

and the misclassihcation rate penalty is 

1 - 1/2 = 0.5. 

The penalty function L satisfies the regularity conditions that the penalty L(Y, a) is 
finite if the probability a* assigned to the outcome y* is positive and if Y = y, : . It is also 
assumed that the penalty L(Y, a) is 0 if, and only if, the probability a* assigned to the 
outcome y t is 1 and Y = y i: so that a = Z. These requirements hold if L is Lq, L h , or L M . 
In addition, L c never exceeds 2, and L M never exceeds 1. 

The fundamental assumption is that the smallest expected penalty from use of a 
constant prediction function is observed if the prediction function is pro For members a 
and b of the unit simplex, let 

S 

D*( a,b) = ^2diL(y h b) 

i =1 
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be the expected penalty E(L(Y,q(X))) if p = a and if q(x) is b for all x in T. Obviously 
the expected penalty D*(a, b) is nonnegative. If bi is positive whenever a* is positive, then 
D * (a, b) is finite. Unless a = b = <5, for some possible value y % of the dependent variable 
Y, D*( a, b) is positive. For a given a, it is assumed that the expected penalty D(a, b) is 
smallest if b = a, so that 

D( a) = D*{ a, a) < D*{ a,b), 

and E(a) is nonnegative and finite. These requirements hold if the penalty function L is Lq, 
Lhi or Lm (Haberman, 1982a; Haberman, 1982b). Rather remarkably, these requirements 
provide a justihcation for use of logarithmic probability. If the number s of possible values 
of Y is at least 3, then a penalty function L that satisfies all regularity conditions and 
satisfies the condition that L(y ll a) is determined by a* for each i from 1 to s must be equal 
to dL H for some positive real d (Savage, 1971). 

The function D is used to define basic measures of dispersion and association 
(Haberman, 1982a; Haberman, 1982b). The unconditional dispersion measure Jy is defined 
to be the expected penalty -D(py) = E(L(Y, pyc)) from probability prediction of Y by the 
constant predictor pyc- This measure is nonnegative and finite, and Jy is 0 if, and only if, 
py(y) = 1 for some possible value y of Y, so that Y = y and Z = py with probability 1. In 
this latter case, Y is said to be essentially constant. 

The dispersion measures associated with squared error penalty and logarithmic penalty 
are commonly used. In the case of the squared error function Lq, D( p) is the Gini 
concentration 

Cy = 1 - Pv(y )] 2 

y&S 

(Gini, 1912). It should be noted that Cy is simply the probability that the sample variables 
Yi and Y 2 satisfy Y\ ^ Y 2 . Thus in the case of essay scoring, Cy is the probability that 
two randomly chosen essays have the same human holistic score. In the case of log penalty, 
D(py) is the Shannon entropy 

Hy = -J>(i/) l °SPY(y) 

yes 
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(Shannon, 1948). For misclassification rate, D(py) is the minimum classification rate error 


My = 1 — ma xpy(y) 

y£S 


obtained if Y is classified as having a constant value y without regard to X. Because 


£ M »)] 2 < 

y&S 


max py (y) E Py(v) = ma xpy(y), 
■ V J y&S y 


with equality only if some py(y) is 1, the minimum classification rate error My never 
exceeds the concentration Cy, Cy < 1 — s -1 , and My < Cy if Y is not essentially constant 
(Gilula & Haberman, 1995a; Goodman & Kruskal, 1954). Because the logarithm of a 
positive real number d never exceeds d — 1, and log(d) only equals d — 1 if d — 1, 


Hy > ^p Y (y)[ 1 - Py(v)\ = Cy , 
yes 

with equality only if Y is essentially constant. In addition, Hy < log(s) (Gilula & 
Haberman, 1995a). In the case of the probabilities for essay scores in Table 1, the best 
strategy for classification is to classify all essays by the score 4. The misclassification rate 
My is then 1 — 0.3100 = 0.6900. As expected, the concentration Cy = 0.7740 exceeds 
My = 0.6900 and Cy is less than 1 — 6 -1 = 0.8333. In addition, the entropy Hy = 1.6043 
exceeds the concentration Cy and is less than log(6) = 1.7918. 

The conditional dispersion measure Jy.x(x ) of Y given X = x is the dispersion 
D(py. x (x)). The conditional dispersion measure Jy.x is the expected penalty 


E(Jy. x (X)) = E(L(Y, py.x(X))) 


from use of the conditional probability prediction py.x(A") for Y. This measure is the 
smallest possible expected penalty E(L(Y, q(X))) for prediction of Y by a probability 
predictor q(AT). Because pyc is a probability predictor, 0 < Jy.x < Jy ■ The conditional 
dispersion Jy.x is 0 if, and only if, for some function c from T to S, Y = c(A^) with 
probability 1. Thus Y may be said to be essentially determined by X. At the other 
extreme, the conditional and unconditional dispersions Jy and Jy.x are the same if A" and 
Y are independent, so that py.x(A") = py with probability 1. 
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If the squared error penalty function is used, then the conditional dispersion Jy.x(x) 
given X = x is the conditional concentration 


Cyx(x) = 1 - ^[p Y -x(y k)] 2 

yeS 

of Y given X = x , and the conditional concentration of Y given X is 


Cyx = E(C r x(X)) = 1 - £ E(\t,(y\X)f). 

yeS 

Thus Cy x is the conditional probability that Y\ ^ Y 2 given that X\ = X 2 . If the 
logarithmic penalty is employed, then Jy.x(x) is the conditional entropy 


C Y .x(x) 


of Y given A" — x, and 


^PY-x(y\x) \ogp Y -x(y\x) 

yes 


Hy.x = E{H y .x{X)) - -Ys^PyxUi X) \ogp Y .x{y\X)) 

yes 

is the conditional entropy of Y given X. For misclassihcation rate, Jy x{x) is the minimum 
classification error rate 

My.x(x') = 1 - maxp Y . x (y\x ) 
yes 

for Y given X = x and J Y -x is the minimum classihcation error rate 


M Y -x = E(M y .x(X)) = 1 - E(maxp Y . x (y\X)) 

yes 

from classihcation of Y by use of a function of X. As in the unconditional case, 
My-x(x) < Cy-x(x) < Hy-x(x), and My-x < Cy x < Hy.x■ For example, in the case of 
essay scoring, Cy.x = 0.6266 exceeds My.x = 0.524 and is less than Hy.x = 1.1851. 

If the expected penalty function D is strictly concave, then the conditional dispersion 
measure Jy.\ is equal to the unconditional dispersion measure Jy only if X and Y are 
independent. In the case of squared error penalty and logarithmic penalty, the function D 
is strictly concave. Thus Cy.x = Cy or Hy.x = Hy implies independence of the dependent 
variable Y and the independent variable X. For misclassihcation rate, D is not strictly 
concave, and the conditional classihcation error rate My.x may equal the unconditional 



classification error rate My without independence of X and Y. For an extreme case, let Y 
and X have possible values 1 and 2, so that s is 2 and S and T are the integers 1 and 2. Let 
the probability p x {x) that X = x be 0.5 for x equal 1 or 2, let py-x{ 111) = 1, Py x( 211) = 0, 
and Py-x{ 1|2) = Pyx( 2|2) = 0.5, so that py{ 1) = 0.75 and py( 2) = 0.25. Obviously X 
and Y are dependent; however, My-x = My = 0.25. In contrast, Cy.x = 0.25 is less than 
Cy = 0.375, and H Y . X = 0.3466 is less than 0.5623. 

The dispersion measures described in this section are commonly used to construct 
analogues of the coefficient of determination of regression analysis to describe the strength 
of the relationship between Y and X. This practice is particularly well known if T is finite, 
so that X is polytomous (Goodman & Kruskal, 1954). If Jy > 0, then 

.. Jyx 
Pyx = 1 jy 

measures the proportional reduction in loss from use of X as a predictor of Y. Given the 
inequality constraints on the conditional dispersion Jy.x and the unconditional dispersion 
J Y , it follows that 0 < py.x < 1, with pyx = 1 if, and only if, Y is essentially determined 
by X. If X and Y are independent, then py.x is 0. If D is strictly concave, then py.x is 
only 0 if X and Y are independent. 

The Goodman and Kruskal A coefficient Xy.x is 1 — Myj/My, and the Goodman 
and Kruskal r coefficient Ty. x is 1 — Cy-x/Cy (Goodman & Kruskal, 1954). The 
Theil uncertainty coefficient Uy. x is 1 — Hy.x/Hy (Theil, 1971). As evident from 
the relationships between independence and equality of conditional and unconditional 
dispersion measures, Ty.x and Uy. x are only 0 if X and Y are independent. In contrast, 
Xy. x may be 0 for dependent X and Y. In the example used previously to illustrate 
the possibility of dependence with equal conditional and unconditional classification 
error rates, Xy. x = 0, t y . x = 0.3515, and Uy. x = 0.3837. Note that Ty.x and Uy.x 
are relatively similar, and they both suggest that an appreciable reduction in error is 
achieved by use of X in prediction of Y. On the other hand, Xy.x suggests that X has no 
value as a predictor of Y. In the example of essay scoring, differences in results are less 
dramatic, for squared error penalty leads to Ty.x = 1 — 0.6266/0.7740 = 0.1904, logarithmic 
probability penalty leads to Uy.x = 1 — 1.1851/1.6043 = 0.2613, and misclassification 
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error leads to Ay.x = 1 — 0.5240/0.6900 = 0.2406. In all cases, the indication is that the 
machine-generated essay score permits a modest improvement in probability prediction 
relative to the prediction achievable without the machine-generated score. 


2 Sampling and Probability Prediction 

A probability prediction of Y given A" may be developed by use of sample observations. 
For some simple examples, for y in S and x in T, let fy(y) be the number of integers h with 
Y h = y , let f Y x(y , x) be the number of integers h with Y h = y and X h = x, and let fx{x) be 
the number of integers h with X h = x. One might consider the probability predictor py^ 
with coordinate i equal to p Y (yi) = n~ 1 f Y (yi), the fraction of the integers h with Yh = yt- 
For a slightly more complex case, consider py.y, where p y.xix) is pyc if fx(x) = 0 and 
py-xix) has coordinate i equal to the relative frequency 


PY-x{y%\x) = fyx(yi,x)/fx(x) 


if fx(x) > 0. The functions py.x and pyc have the disadvantage that they can provide 
probability predictors that have coordinates equal to 0. As a consequence, alternatives of 
interest are pycc* and py.xa, where ct is an s-dimensional vectors with positive coordinates 
on with sum a + = ]y/ s =1 p YCa (x) has coordinate i equal to 


pYCaiVi \x) = 


fy{y%) + 


n + a + 

for any x in T and p y.xa( x ) has coordinate i equal to 


Pr-xcx{yi\x) 


fy{yi)+OLj 
n+a+ ’ 

f(yi,Y+ a i 

fx(x)+a + ’ 


f X (x) = 0, 

fx(x) > 0. 


It is possible to consider a* = 0.5. This choice is consistent with common estimation 
procedures for logarithms of ratios of probabilities (Anscombe, 1956). 

In general, a function q of the observations X h and Y hl 1 < h < n, is considered. 
For any given value of the X h and Y h , q is a probability predictor with value q(x) at 
x in T, and q(x) has coordinate i equal to qpypx). The function q(A") equal to q(x) if 
X = x is assumed to be an s-dimensional random vector. The function q may be termed 
a sample probability predictor. It is easily seen that pyc, Pyc<*, Py x, and py.x<* are all 
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sample probability predictors. Many more complex sample probability predictors may be 
constructed by use of log-linear models (Gilula & Haberman, 1995b). 

To assess the value of the sample probability predictor q, the expected penalty is 
evaluated. The penalty under study is a random variable L(Y, q(X)) that depends on the 
observations X and Y under study and on the sampled variables Xh and Y^. To find the 
expected penalty efficiently requires several arguments involving conditional expectations. 
Given X = x and given the observed X h and Y h , the conditional expected value of the 
penalty L(Y, q(X)) from probability prediction of Y from q is the random variable 

D*(Py x(x), q(x)) > J Y -x(x). 


Let 

F(q|a;) = D*(p Y . x (x), q(x)) - J Y . x (x) > 0 

denote the conditional excess expected penalty given X = x and the X h and Y h . Then 
F(q\x) = ^2p Y . x (y\x)[L(y, q(x)) - L(y,p Y .x(x))] 

yes 

(Haberman, 1982a). Given X = x, the conditional expected excess penalty is 

A(q|x) = E(F( q|x)), 

and the expected excess penalty is 

B(q)=E(A(q\X)). 


The expected penalty is then 


m = £(q) + Jy-x. 
In the case of squared error, F(q|rc) is 


Fc(q\x) = ^[q(y\x) - p Y . x (y\x)] 2 , 
yes 

and A(q|x) is 

Ac(q|ar) = ^2{(y 2 (q{;y\x)) + [E(q(y\x) ~ p Y -x{y\x)] 2 } 

yes 
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(Haberman, 1982a). The expected excess penalty is then 


Bc( q) = E(A c (q\X)), 


and the expected penalty is 


Ic{ q) = Bc{ q) + Cyx- 

In the case of the logarithmic probability penalty, F(q|x) is 


F H (q\x) = ^ PY-x(y\x ) log 
2/6S 


PY-x(y\x) 

[ q(y\x) 


(Haberman, 1982a), so that Fh{ q|x) is oo if q(y\x) is 0 for some y such that Py-x(v\x) > 0. 
The expected excess penalty B( q) is then 


Bh( q) = S(A H (q|X)), 


and the expected penalty is 


Ih{cl) = B h {cl) + H, 


YX- 


Both Bh{ q) and Ih{ q) may be infinite. 

For misclassihcation rate, F( q|x) becomes F M (q\x), where F M (q\x) is the difference 
between |py.x(^)|oo and the average of the Py-x(v\x) for y in S such that q(y\x) = |q(x)|o, 
The conditional expected excess penalty given X = x becomes 

A M (qk) = E(F M ( q|x)), 

the expected excess penalty becomes 


and the expected penalty is 


B m { q) = ^(A M (q|X)), 


hl(q) — B M (c l) + My.x- 


3 Penalty Criteria in Large Samples 

The large-sample behavior of the expected excess penalty is very different for 
misclassihcation rate penalty than for squared error penalty or for logarithmic probability 
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penalty. As evident in Section 2, because the logarithmic probability penalty can be 
infinite, some differences exist for results for logarithmic penalty functions and squared 
error penalty functions. In the cases of squared error penalty and logarithmic probability 
penalty, large-sample properties have been explored previously in the case of log-linear 
models (Haberman, 1982a; Gilula & Haberman, 1995b); however, a more precise and 
simpler discussion of differences is provided by examination of the case of p Y -x an d p y-xa 
for T, the range of A", a finite set with v elements, and for the probability Px( x ) that X = x 
positive for each x in T. To facilitate comparison of results for different choices of X , it is 
helpful to use the minimum expected value 


g — n mmp x (x) 

of the counts fx( x )- In the case of squared error penalty, Bc(py-x) and B c (py-Xo.) are 
both of order g~ l . In the case of logarithmic probability penalty, Bh{Py-x) is infinite, and 
B h (Py xcx) is °f order g _1 . In the case of misclassihcation rate penalty, B M {p Y . x ) and 
Bm{Py-Xcx) are equal and are typically of order exp (—(3g) for some real (3 > 0. 

To verify these claims, a few basic results concerning the distribution of the frequency 
counts fx(x) should be noted. The probability that fx( x ) = 0 is r(x) = [1 — Px(x)] n . Note 
that log(d) < d — 1 if d is a positive real number other than 1. It follows that r{x) is equal 
to 

exp{nlog[l — px{x)]\ < exp [—npx(x)\ < exp (—g), 

so that convergence of this probability to 0 is exponentially fast for each x in T. It is also 
helpful to note that 

m(x) = E(l/f x (x)\fx(x) > 0) 

is bounded below by 

1 -k(x) 

u l\ x ) = 7- T\ -TW 

(n + l)p x (x) 
for 

npx{x)r{x) 

k( x ) = -y-r“ 

1 — r(x) 

and bounded above by 

ui(x) + 3 u 2 (x) 
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for 


u 2 {x) = 


ui(x) — k(x)/2 


(n + 2 )p x (x) 

(Stephan, 1945). Thus np x {x)rn{x) differs from 1 by a term of order g -1 . 

Given these preliminaries, each penalty function requires separate attention. 


3.1 Squared Error 


In the case of squared error penalty, both Bc(p Y x ) and Bc(p Y .xa) are well 
approximated by 

n~ l ^ C'y.x(x) < g~ 1 C Y -x- 

x£T 

Thus the estimated excess penalty is of order g~ l . Although the details of verification of 
this claim are a bit complicated, the basic principles are readily summarized. 

The simplest case to consider is py.x- hi this instance, given that f x {x) > 0, the 
conditional expectation of Pyx(v\ x) is Py-x(v\x). Given that fx(x) = 0, the conditional 
expectation of Pyx(u\x) is 


P{Y = y\X + x ) 


Pv(y) -Pr(y \X = x)px(x) 
l-px(x) 


It follows that 


E (PY-x(y\x)) = [1 - r(x)]p Y . x (y\x) + r(x)P(Y = y\X ± x) 

differs from p Y -x(y\x) by a term of order exp(— g). In like fashion, the conditional variance 
of p Y .x(y\x) given f x (x) > 0 is 

P Y x(y\x)[l ~PY-x(y\x)\/fx(x), 

and the conditional variance of p Y .x(y\x) given f x ( 0) = 0 is 

P(Y = y\X ± x)[l - P(Y = y\X ± x )\/ n . 

It follows that 

np x (x)a 2 (p Y . x (y\x)) - p Y x{y\x)[l - p Y -x(y\x)} 
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is of order g 1 . Because 

^2py-x(v\x)[ 1 -PY-x(y\x)] = Cy-x(x), 
yes 

it follows that 

B c (py-x) ~ n~ x ^ Cy-x(x) 

xeT 

is of order g~ 2 . Because Cy-x(x) cannot exceed 1 — and g < n/v, Bc(py-x) is of order 
g l . Slight changes in arguments can be used to show that 

Bc{Py-Xo) — n^ 1 ^ Cy.x(x) 

xeT 

is also of order g~ 2 . Note that Cy-x(x) is positive if distinct y and y' exist such that the 
conditional probabilities Py-x{u\x) and pY-x{y'\x) are both positive. 

For the probabilities in Table 1, if n = 1,000, then g is 23, and the expected excess 
squared error penalty is about 0.00378. For n = 10,000, g is 230, and the expected excess 
is about 0.000378. 


3.2 Logarithmic Probability 

In the case of logarithmic probability, p Y -x(y\x) is 0 with positive probability for a case 
with pY-x(y\x) = 0 unless some function c on T exists for which Y = c(X). In this trivial 
case, the expected excess penalty Bh(py-x) is 0. Otherwise, Bh(py-x) is infinite, although 
it should be noted that, for any positive real d, the expected minimum of d and F H (p Y .x) 
approaches v(s — 1 )/n whenever each conditional probability Py-x{v\x) is positive (Gilula 
& Haberman, 1995b). 

More interesting results are available if p y xo. is considered. It is simplest to confine 
attention to the case in which py x(y\x ) is always positive. To facilitate comparison of 
results for different choices of X, the condition may be used that a positive real d exists 
such that Py-x(u\x) > d for all y and x. The basic result obtained is quite simple, for the 
expected excess penalty B H (p Y .xa) differs from v(s — 1 )/n by a term of order g~ 2 . This 
result is predictable given current literature (Gilula & Haberman, 1995b). It should be 
noted that, for sufficiently large sample sizes, Bh{j>yxol) is at least s times larger than 
Bc{Pyx<x)- 
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The argument required to prove results for the logarithmic penalty is rather similar to 
that used for squared error; however, a bit more effort is required because \ogpY.xa(l/i\ x ) 
is a nonlinear function of py.x(yi\x) that may be as small as log(cq/(n + aq_)). Details are 
not considered here; however, it is worth noting that two basic principles are involved. For 
PY-x(y\x) > 0, the logarithms \og\pYXa{y\x)/PY-x(y\x )] are approximated by 

Pv-x a (y\x) -p Y .x(y\x) 
p Y -x(y\x ) 

by use of the elementary expansion 

log (6/a) = -—- 
c 

for b and a real and positive and c a real number between b and a. To place limits on the 
probability that \py-xa(y\x) — PY-x(y\x)\ exceeds some small quantity 5, large deviation 
theory is used (Bahadur & Ranga Rao, 1960). Let a and b be positive real numbers less 
than 1, and let 

d = a log (^j + (1 - a) log 

Let / be a binomial random variable with sample size k > 0 and probability b. If a > b, 
then the probability that f/k > a does not exceed exp(— kd). If a < b, then the probability 
that f /k < a does not exceed exp (—kd). 

For n = 1,000 and for probabilities defined as in Table 1, the expected excess 
logarithmic probability penalty is about 0.03. If n is 10,000, then the expected excess 
penalty is reduced to 0.003. As expected, these values are somewhat larger than the 
corresponding ones for squared error penalty. 

3.3 Mis classification Penalty 

For misclassification penalty, Bm(py-x) = ^/(pr-.Va) because py.x(y\x) > py.x(y'\x) 
if, and only if, py.xa(y\x) > py-xa(y'\x ) for y and y' in S and x in T. In addition, 
B m {Pyx) is trivially 0 if Y satisfies the equiprobability condition that py.x(y\x) = s _1 
for all y and x. In other cases, large-deviation theory may be applied to obtain an upper 
bound on Bm{pyx)- One finds that Bm(py-x) is of order exp(— (3g) for some (3 > 0. As 
will be evident from examination of Table 1, this exponential rate of convergence to 0 does 
not necessarily imply that Bm(Py-x) is negligible even for relatively large samples. 


1 — a 
1 -b 
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To examine the required large-deviation theory, let p(y, x) be the probability that 
X = x and Y = y, so that p(y,x ) = Px(x)py.x(y\x). Let Pyx(v |^) > PY-x(y'\x), and let 


7 (y,y'\x) = {[PY-x(y\x)} 1/2 - [PY-x(y'\x)) 1/2 } 2 . 


For real nonnegative a and b, a > b, 

( a V2_ 6 l/2 )(a l/2 + 6 l/2 )=a _^ 


It follows that 

7 (y,y'\x) > ^\pv.x(y\x) -p Y .x{y'\x)} 2 . 

if 

rn(y,y'\x) = 1 -p(y,x) -p(y',x) + 2 [p(y,x)p(y',x)} 1/2 , 
v(y, y'\x) = p(y , x) + p(y\ x) - \p(y, x) - p(y\ x )} 2 , 


and 


then 


C (y,y'\x) 


2[1 -p(y',x)/p(y,x)] x , s = 2, 
{1 - \p(y', x)/p(y, x)] l / 2 Y l , s > 1 , 


m{y,y'\x) = 1 - p x (x)^(y, y'\x) < 1 , 


the probability $,(y,y'\x) that f Y x(y',x) > f Y x(y,x) does not exceed [m(y, y'\x)] n , and 
^(y,y'\x) is well approximated by 


to(.y,y'\x) 


[m(y,y'\x)] n 

[2Tmv(y,y'\x)} 1 ^((y,y'\ x ) 


in the sense that 


£(y,y'\x) -£ 0 (y,y'\x) 

£(y,y'\x) 

is of order g -1 (Bahadur & Ranga Rao, 1960). Use of the inequality d 
positive real d ^ 1 implies that 


1 < log(d) for 


m(y,y'\x)] n < exp[-np x (xy(y, y'\x) . 


In addition, for any x in T, 

F m (py-x\x) < ^[Ipy-xIoc - p Y -x(y\x)]u(y\x), 
yes 
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where u(y\x) is 1 if Pyx(u\x) > PY x(l/i |^) for each i in A(py. x (x)) and u(y\x ) is 0 
otherwise. Let py.x(c(x)\x ) = |py.x(^)|oo- If follows that 

B m {Py-x) < ^2px(x)Y^[\Py-x(x)\ 00 - p Y -x(y\x)][r(x) + ^(c(x),y\x)}}}. 

x£T y£S 

Thus a real r > 0 and (3 > 0 exists such that -E>a/(py-x) is less than rexp(— j3g) for g 
sufficiently large. The size of (3 is at least one quarter the square of the smallest difference 
Py-x(c(x)\x) — p Y -x(yi\x) for i not in A(py.x(x) and x in T, and r can be selected not 
to exceed 2(s — 1). The bound is quite generous, as is evident from the more accurate 
approximation to f(y, y'\x). 

For the probabilities in Table 1, if n = 1,000, then use of upper bounds shows that 
the expected excess penalty does not exceed 0.0129, but the refined approximation yields 
0.00434. For n = 10, 000, the upper bound is 0.0000959, and the refined approximation is 
0.0000140. Note that the value for n = 1, 000 is quite comparable to that for squared error, 
but the expected excess for n = 10, 000 is very small. In general, the exponential rate of the 
convergence to 0 of Bm{py-x) implies, at least for a large enough sample size, that a much 
smaller expected excess penalty is achieved for misclassihcation penalty than is achieved in 
the case of squared error or logarithmic probability penalty. 


3.4 Comparison of Expected Penalties 


For an additional simple illustration of the implications of the large-sample properties 
of the sample probability predictors under study, consider s = 2, and define a uniformly 
distributed random variable W with range (1/4, 3/4) such that the conditional probability 
that Y = 1 given that W = w is w. Let v be a positive integer, and let X be the largest 
integer not greater than 2 v(W — 1/4), so that p x (x) = v ~ 1 for integers x from 0 to v — 1, 
and 


y y, . 1 2x —I - 1 

? (1W = 4 + —• 

Straightforward calculations show that, in the case of squared error, the condition 
concentration of Y given W is 


C Y . W 


2 



2w(l — w)dw 


11 

24’ 


18 



the conditional concentration of Y given X is 

c _ 11 
( ~ Y X ~ — 


24 ' 24u 2 ’ 

and the expected excess penalties Bc(py-x) and Bc(Pnxa) are well approximated by 
vCy-x/n. As n approaches oo and v/n approaches 0, the expected penalties Ic(Pyx) and 
Ic(Py-x<x) are well approximated by 

'll 1 
24 + 24^ 


(1 + v/n). 


At this point, there is a tradeoff to consider. More categories v in the definition of X leads 
to a smaller conditional dispersion Cy-x but a larger approximation for the estimated excess 
penalties Bc(py-x) and Bc(Puxa)- For n large, the optimal situation has v approximately 
equal to (3n/ll) 1 ^ 3 , so that the expected penalties Ic(Pyx) and Ic(Py-Xcx) are well 
approximated by 

c + V n V /3 

Cy w + 6 V ' 

Note that for sufficiently large n, the expected penalties from use of probability predictors 
from sample data becomes increasingly close to the expected penalty achieved through 
prediction of Y by W under the condition that Py-w is known. The difference in expected 
penalties is of order n -2//3 . As an illustration of results, consider n = 1, 000. In this case, v, 
which must be an integer, may be taken as 6, and Ic(Py-x) and Ic{Py-Xo) exceed Cy-w by 
about 0.0039. 

In like manner, for the logarithmic penalty, the conditional entropy of Y given W is 

/■s/4 


Hy.w — 


f 1/4 


w log(w) + (1 — w) log(l — w)\dw 


= \ + 21og(2) - ^log(3) 

= 0.650, 

and the conditional entropy Hy.x of Y given X is well approximated by 


f3/4 


Hy. 


Y-W 


48v 2 


' 1/4 


[w(l - w)} l dw = Hy.w + 2^2 lo s( 3 )- 


It follows that the expected penalty Ih(Py-Xo)) is well approximated by 

Hy.w 


i '° g(3) + Vn 
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To reduce expected penalty from use of py.x«, the optimal choice of v for large n has v 
approximately equal to (4 _1 nlog3) 1/,s , so that Ih(Py-Xcx) is well approximated by 


Hy-w + 


2 

3 


/ log 3 \ 1/3 
\ 4n 2 / 


For sufficiently large n, the expected penalty from use of sample data becomes increasingly 
close to the expected penalty achieved through prediction of Y by W under the condition 
that p y-w is known. As in the case of concentration, the difference in expected penalties is 
of order n ~ 2 / 3 . For n = 1, 000, the optimal choice of v is 6, and the expected penalty for the 
sample predictor exceeds the expected penalty for p Y -x by 0.0043. Thus results for squared 
error and logarithmic penalty are quite similar. 

The situation is very different for misclassification penalty. Here the conditional 
misclassification rate My-w of Y given W is easily seen to be (l/2 + l/4)/2 = 3/8, and the 
conditional misclassification rate My.x of Y given X is 3/8 for v even and 3/8 + l/(8u 2 ) 
for v odd. In terms of the expected penalty Im{Py-x) = Im(py-Xo), the optimal choice of v 
is 2. In this case, the expected penalty is very close to M Y .w + (l/4)£(2, l|x), and £(2,111) 
is bounded above by 0.9841 n . It follows that the expected excess misclassification penalty 
is less than lO - ' if n = 1, 000, a figure drastically smaller than the corresponding values for 
squared error penalty or logarithmic probability penalty. 


4 Conclusions 

The large-sample properties associated with squared error penalty, logarithmic 
probability penalty, and misclassification penalty indicate that misclassification penalty 
exhibits very dif ferent behavior than do the other penalties. One might think that the 
asymptotic results imply a superiority of misclassification penalty on the grounds that, for 
a sufficiently large sample size, if each conditional probability Py-x(v\x) is positive, then the 
expected excess penalty is smaller for misclassification penalty than for the other choices of 
penalty functions. In reality, the apparent advantage of misclassification penalty reflects a 
very serious flaw in the criterion. The misclassification rate is very insensitive to variations 
in predicted probabilities unless two or more predicted probabilities are nearly the same. 
For example, in the example of prediction of a dichotomous response, the predictor X for 
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Table 2. 


Joint Probabilities of Human and Machine Scores 


Machine 

score 

1 

2 

Human 

3 

score 

4 

5 

6 

Total 

1 

0.0110 

0.0120 

0.0000 

0.0000 

0.0000 

0.0000 

0.0230 

2 

0.0100 

0.0760 

0.0760 

0.0000 

0.0000 

0.0000 

0.1620 

3 

0.0000 

0.1000 

0.1000 

0.0290 

0.0000 

0.0000 

0.2290 

4 

0.0000 

0.0000 

0.1740 

0.1760 

0.0000 

0.0000 

0.3500 

5 

0.0000 

0.0000 

0.0000 

0.0800 

0.0800 

0.0170 

0.1770 

6 

0.0000 

0.0000 

0.0000 

0.0000 

0.0270 

0.0320 

0.0590 

Total 

0.0210 

0.1880 

0.3500 

0.2850 

0.1070 

0.0490 

1.0000 


v = 2 is as effective as the original variable W in terms of misclassification rate despite 
the substantial variability of the conditional probability py(y\w) as a function of w. On 
the other hand, the differences Cy. x ~ Cy.w = 1/96 and Hy. x — H Y . W = 0.011 are fairly 
substantial. 

The example of essay scoring exhibits a similar issue. The same conditional 
misclassification rate is achieved if Table 1 is modified to yield Table 2. Table 1 and Table 2 
are quite different. Other measures reflect the change. The conditional concentration Cy. x 
is changed from 0.6266 to 0.5457, and the conditional entropy Hy. x is changed from 1.1851 
to 0.8873. The latter two measures reflect the decreased dispersion in Table 2 relative to 
Table 1. 

In practice, attempts to use misclassification penalty rather than more sensitive penalty 
functions are likely to obscure actual improvements in prediction. For example, progress in 
the machine scoring of essays can be expected to be obscured as long as criteria based on 
misclassification rates are employed. 
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